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Abstract 

Consider observation data, comprised of n observation vectors with 
values on a set of attributes. This gives us n points in attribute space. 
Having data structured as a tree, imphed by having our observations 
embedded in an ultrametric topology, offers great advantage for proximity 
searching. If we have preprocessed data through such an embedding, then 
an observation's nearest neighbor is found in constant computational time, 
i.e. 0(1) time. A further powerful approach is discussed in this work: the 
inducing of a hierarchy, and hence a tree, in linear computational time, 
i.e. 0(n) time for n observations. It is with such a basis for proximity 
search and best match that we can address the burgeoning problems of 
processing very large, and possibly also very high dimensional, data sets. 

1 Introduction 

Under the heading of "Addressing the big data challenge", the European 7th 
Framework Programme sees the issue thus (see INFSO, 2012): "Recent industry 
reports detail how data volumes are growing at a faster rate than our ability to 
interpret and exploit them for innovative ICT applications, for decision support, 
planning, monitoring, control and interaction. This includes unstructured data 
types such as video, audio, images and free text as well as structured data types 
such as database records, sensor readings and 3D. While each of these types 
requires some specific form of processing and analytics, many of the general 
principles for managing and storing them at extreme scales are common across 
all of them." Analytics tool capability is called for, to address these burgeoning 
issues in the data intensive industries, to support "effective policy making and 
implementation" of public bodies resulting in "significant annual savings from 
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Big Data applications", and also to exploit open, linked data "foster the 
reuse of public sector information and strengthen other open data activities 
linked to commercial exploitation." The "big data" marketplace is stated to be 
potentially worth approximately USD 600 billion. 

To address the challenges of search and discovery in massive and complex 
data sets and data flows, it is our contention in this work that we must move to 
an appropriate topology to an appropriate framework such that computation 
is greatly facilitated. Our work is all about empowering those who are involved 
in data analytics, through clustering and related algorithms, to face these new 
challenges. Scalability and interactivity are two of the performance issues that 
follow directly from clustering algorithms, for search, retrieval and discovery, 
that are of linear computational complexity or better (logarithmic, or constant). 

2 Ultrametric Information Spaces 

For high dimensional spaces and also for massive data spaces, it has been shown 

in Murtagh (2004) that one can exploit both symmetry and sparsity to great 
effect in order to carry out nearest neighbor or best match search and other 
related operations. 

The triangular inequality holds for a metric space: d{x, z) < d{x, y) +d{y, z) 
for any triplet of points, x,y,z. In addition the properties of symmetry and 
positive definiteness are respected. The "strong triangular inequality" or ul- 
trametric inequality is: d{x,z) < max{d{x,y),d{y, z)} for any triplet x,y,z. 
An ultrametric space (Benzecri, 1979; van Rooij, 1978) implies respect for a 
range of stringent properties. For example, the triangle formed by any triplet 
is necessarily isosceles, with the two large sides equal; or is equilateral. 

2.1 Computational Costs of Operations in an Ultrametric 
Space 

Given that sparse forms of coding are considered for how complex stimuli are 
represented in the cortex (sec Young and Yamane, 1992), the ultramctricity 
of such spaces becomes important because of this sparseness of coding. Among 
other implications, this points to the possibility that semantic pattern matching 
is best accomplished through ultrametric computation. 

A convenient data structure for points in an ultrametric space is a den- 
drogram. We define a dendrogram as a rooted, labeled, ranked, binary tree 
(Murtagh, 1984a). For n observations, with such a definition of tree, there are 
precisely n — 1 levels. With each level there is an associated rank l,2,...,n— 1, 
with level 1 corresponding to the singletons, and level n — 1 corresponding to 
the root node, and also to the cluster that encompasses all observations. With 
such a tree, there is an associated distance on the tree, termed the ultrametric 
distance, which is a mapping (of the Cartesian product of the observation set 
with itself) into the positive reals. 
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We will use the terms point and observation interchangeably, when the con- 
text allows. That is to say, an observation vector is a point in a space of ambient 
dimensionality defined by the cardinality of the attribute set, on which the ob- 
servation takes values. 

Operations on binary trees are often based on tree traversal between root 
and terminal. See e.g. van Rijsbergen (1979). Hence computational cost of such 
operations is dependent on root-to-terminal(s) path length. The total path 
length of a root-to-terminal traversal varies for each terminal (or point in the 
corresponding ultrametric space). It is simplest to consider path length in terms 
of level or tree node rank (and if it is necessary to avail of path length in terms 
of ultrametric distances, then constant computational time, only, is needed for 
table lookup). A dendrogram's root-to-terminal path length can vary from 
close to log2n ("close to" because the path length has to be an integer) to n — 1 
(Murtagh, 1984b). Let us call this computational cost of a tree traversal 0{t). 

Most operations that we will now consider make use of a dendrogram data 
structure. Hence the cost of building a dendrogram is important. For the 
problem in general, see Kfivanek and Moravek (1984, 1986) and Day (1996). 
For O(n^) implementations of most commonly used hierarchical clustering al- 
gorithms, see Murtagh (1983, 1985). In section [3] we will address the issue of 
efficiently constructing a hierarchical clustering, and hence mapping observed 
data into an ultrametric topology. We will discuss a linear time approach for 
this. 

To place a new point (from an ultrametric space) into a dendrogram, we 
need to find its nearest neighbor. We can do this, in order to write the new 
terminal into the dendrogram, using a root-to-terminal traversal in the current 
version of a dendrogram. This leads to our first proposition. 

Proposition 1: The computational complexity of adding a new terminal to 
a dendrogram is 0(i), where t is one traversal from root to terminals in the 
dendrogram. 

Proposition 2: The computational complexity of finding the ultrametric dis- 
tance between two terminal nodes is twice the length of a traversal from root 
to terminals in the dendrogram. Therefore distance is computed in 0{t) time. 
Informally: we potentially have to traverse from each terminal to the root in 
order to find the common, "parent" node. 

Proposition 3: The traversal length from dendrogram root to dendrogram 
terminals is best case 1 , and worst case n — 1. When the dendrogram is optimally 
balanced or structured, the traversal length from root to terminals is [log2n\ , 
where [.J is the floor, or integer part, function. Hence 1 > 0{t) > n — 1, and 
for a balanced tree 0{t) = logj n. 

Depending on the agglomerative criterion used, we can approximate the 
balanced or structured dendrogram - and hence favorable case - quite well in 
practice (Murtagh, 1984b). The Ward, or minimum variance, agglomerative 
criterion is shown empirically to be best. 
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Proposition 4: Nearest neighbor search in ultrametric space can be carried 

out in 0(1) or constant time. 

This results from the following: the nearest neighbor pair must be in the 
same tightest cluster that contains them both. There is only one candidate 
to check for in a dendrogram. Hence nearest neighbor finding results in firstly 
finding the lowest level cluster containing the given terminal; followed by finding 
the other terminal in this cluster. Two operations are therefore required. 

2.2 Implications 

In Murtagh (2004a, 2004b) we have shown that high dimensional and sparse 
codings tend to be ultrametric. This is an interesting result in its own right. 
However a far more important result is that certain computational operations 
can be carried out very efficiently indeed in space endowed with an ultrametric. 

Chief among these computational operations, we have noted, is that nearest 
neighbor finding can be carried out in (worst case) constant computational 
time, relative to the number of observables considered, n. Depending on the 
structure of the ultrametric space (i.e. if we can build a balanced dendrogram 
data structure), pairwise distance calculation can be carried out in logarithmic 
computational time. 

We have also (Murtagli, 2004a) reviewed approaches to using ultrametric dis- 
tances in order to expedite best match, or nearest neighbor, or more generally 
proximity search. The usual constructive approach, viz. build a hierarchic clus- 
tering, is simply not computationally feasible in very high dimensional spaces 
as are typically foimd in such fields as speech processing, information retrieval, 
or genomics and proteomics. 

Forms of sparse coding are considered to be used in the human or animal cor- 
tex. We raise the interesting question as to whether human or animal thinking 
can be computationally efficient precisely because such computation is carried 
out in an ultrametric space. For further elaboration on this, see Murtagh (2012a, 
2012b). 

3 Linear Time and Direct Reading Hierarchical 
Clustering 

In areas such as search, matching, retrieval and general data analysis, massive 
increase in data requires new methods that can cope well with the explosion 
in volume and dimensionality of the available data. The Baire metric, which 
is furthermore an ultrametric, has particular advantages when used to induce 
a hierarchy and in tiirn to siipport clustering, matching and other operations. 
See Murtagh and Contreras (2012), and Contreras and Murtagh (2012). 

Arising directly out of the Baire distance is an ultrametric tree, which also 
can be seen as a tree that hierarchically clusters data. This presents a number 
of advantages when storing and retrieving data. When the data source is in 
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numerical form this ultramctric tree can be used as an index structure making 
matching and search, and thus retrieval, much easier. 

The clusters can be associated with hash keys, that is to say, the cluster 
members can be mapped onto "bins" or "buckets" . 

Another vantage point in this work is precision of measurement. Data mea- 
surement precision can be either used as given or modified in order to enhance 
the inherent ultramc^tric and hence hierarchical properties of the data. 

Rather than mapping pairwise relationships onto the reals, as distance does, 
we can alternatively map onto subsets of the power set of, say, attributes of our 
observation set. This is expressed by the generalized ultramctric, which maps 
pairwise relationships into a partially ordered set (see Murtagh, 2011). It is also 
current practice as formal concept analysis where the range of the mapping is 
a lattice. 

Relative to other algorithms the Baire-based hierarchical clustering method 
is fast. It is a direct reading algorithm involving one scan of the input data set, 
and is of linear computational complexity. 

Many vantage points are possible, all in the Baire metric framework. The 
following vantage points are discussed in Murtagh and Contreras (2012). 

• Metric that is simultaneously an ultramctric. 

• Hierarchy induced through m-adic encoding (m positive integer, e.g. 10). 

• p-Adic (p prime) or m-adic clustering. 

• Hashing of data into bins. 

• Data precision of measurement implies how hierarchical the data is. 

• Generalized ultramctric. 

• Lattice-based formal concept analysis. 

• Linear computational time hierarchical clustering. 

3.1 Ultrametric Baire Space and Distance 

A Baire space consists of countably infinite sequences with a metric defined in 
terms of the longest common prefix: the longer the common prefix, the closer 
a pair of sequences. What is of interest to us is this longest common prefix 
metric, which we call the Baire distance (Bradley, 2009; Mirkin and Fishburn, 
1979; Murtagh et al., 2008). 

We begin with the longest common prefixes at issue being digits of precision 
of univariate or scalar values. For example, let us consider two such decimal 
values, X and y, with both measured to some maximum precision. We take as 
maximum precision the length of the value with the fewer decimal digits. With 
no loss of generality we take x and y to be bounded by and 1. Thus we 
consider ordered sets Xk and yk for k € K . So A; = 1 is the first decimal place of 
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precision; fc = 2 is the second decimal place; . . . ; k — \K\ is the \K\ th decimal 
place. The cardinality of the set K is the precision with which a number, x or 
y, is measured. 

Consider as examples 2:3 = 0.478; and j/3 = 0.472. Start from the first 
decimal position. For k = 1, we find xi = yi = 4. For A; = 2, X2 = 2/2 = 7. But 
for k = 3, xsj^ 2/3. 

We now introduce the following distance (case of vectors x and y, with 1 
attribute, hence unidimensional) : 

ds{xK,yK) = I g-. '^lltll i<u< \K\ 

We call this dg value Baire distance, which is a 1-bounded ultrametric 

(Bradley, 2009; Murtagh, 2007) distance, < dg < 1. When dealing with 
binary (boolean) data 2 is the chosen base, B = 2. When working with real 
numbers the base is best defined to be 10, B = 10. With B = 10, for instance, 
it can be seen that the Baire distance is embedded in a 10-way tree which leads 
to a convenient data structure to support search and other operations when we 
have decimal data. As a consequence data can be organized, stored and accessed 
very efficiently and effectively in such a tree. 

For B prime, this distance has been studied by Benois-Pineau et al. (2001) 
and by Bradley (2009, 2010), with many further (topological and number theo- 
retic, leading to algorithmic and computational) insights arising from the p-adic 
(where p is prime) framework. See also Anashin and Khrennikov (2009). 

For use of random projections to allow for analysis of multidimensional data 
in the scope of the Baire distance, see Contrcras and Murtagh (2012) and also 
Murtagh and Contreras (2012). In these works, a range of very large data sets 
are considered, for clustering and for proximity search, in domains that include 
astronomy (photometric and astrometric redshifts), and chemoinformatics. 

3.2 Linear Time, or 0{N) Computational Complexity, Hi- 
erarchical Clustering 

A point of departure for our work has been the computational objective of by- 
passing computationally demanding hierarchical clustering methods (typically 
quadratic time, or 0(n^) for n input observation vectors), but also having a 
framework that is of great practical importance in terms of the application do- 
mains. 

Agglomerative hierarchical clustering algorithms are based on pairwise dis- 
tances (or dissimilarities) implying computational time that is O(n^) where n 
is the number of observations. The implementation required to achieve this 
is, for most agglomerative criteria, the nearest neighbor chain, together with 
the reciprocal nearest neighbors, algorithm (furnishing inversion- free hierarchies 
whenever Bruynooghe's reducibility property, see Murtagh (1985), is satisfied 
by the cluster criterion). 

This quadratic time requirement is a worst case performance result. It is 
most often the average time also since the pairwise agglomerative algorithm 
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is applied directly to the data without any preprocessing speed-ups (such as 
preprocessing that facilitates fast nearest neighbor finding). An example of a 
linear average time algorithm for (worst case quadratic computational time) 
agglomcrative hierarchical clustering is in Murtagh (1983). 

With the Baire-based hierarchical clustering algorithm, we have an algorithm 
for linear time worst case hierarchical clustering. It can be characterized as a 
divisive rather than an agglomerative algorithm. 

3.3 Grid-Based Clustering Algorithms 

The Baire-based hierarchical clustering algorithm has characteristics that are 
related to grid-based clustering algorithms, and density-based clustering algo- 
rithms, which - often - were developed in order to handle very large data sets. 

The main idea here is to use a grid like structure to split the information 
space, separating the dense grid regions from the less dense ones to form groups. 
In general, a typical approach within this category will consist of the following 
steps (Grabusts and Borisov, 2002): 

1. Creating a grid structure, i.e. partitioning the data space into a finite 
number of non-overlapping cells. 

2. Calculating the cell density for each cell. 

3. Sorting of the cells according to their densities. 

4. Identifying cluster centers. 

5. Traversal of neighbor cells. 

Additional background on grid-based clustering can be found in the following 
works: Chang and Jin (2002), Can et al. (2007), Park and Lee (2004), and Xu 
and Wunsch (2008). 

Cluster bins, derived from an m-adic tree, provide us with a grid-based 
framework or data structuring. We can read off the cluster bin members from 
an m-adic tree. An m-adic tree requires one scan through the data, and therefore 
this data structure is constructed in linear computational time. 

In such a preprocessing context, clustering with the Baire distance can be 
seen as a "crude" method for getting clusters. After this we can use more 
traditional techniques to refine the clusters in terms of their membership. Al- 
ternatively (and we have quite extensively compared Baire clustering with, e.g. 
k- means, where it compares very well, see Murtagh ct al., 2008, and Contreras 
and Murtagh, 2012) clustering with the Baire distance can be seen as fully on 
a par with any optimization algorithm for clustering. As optimization, and 
just as one example from the many examples reviewed in this article, the Baire 
approach optimizes an m-adic fit of the data simply by reading the m-adic 
structure directly from the data. 
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4 Conclusions 



Baire distance is an ultrametric, so we can think of reading off observations as 

a tree. 

Through data precision of measurement, alone, we ean enhance inherent 
ultrametricity, or inherent hierarchical properties in the data. 

Clusters in such a Baire-based hierarchy are simple "bins" and assignments 
are determined through a very simple hashing. (E.g. 0.3475 — > bin 3, and — > 
bin 34, and — > bin 347, and — > bin 3475.) 

As wc have observed, certain search-related computational operations can be 
carried out very efhciently indeed in space endowed with an ultrametric. Chief 
among these computational operations is that nearest neighbor finding can be 
carried out in (worst case) constant computational time. Depending on the 
structure of the ultrametric space (i.e. if we can build a balanced dendrogram 
data structure), pairwise distance calculation can be carried out in logarithmic 
computational time. 

In conclusion we have here a comprehensive approach, founded on ultramet- 
ric topology rather than more traditional metric geometry, in order to address 
the burgeoning problems presented by "big data" analytics, i.e. massive data 
sets in potentially very high dimensional spaces. 
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